Objective:
To categorize the countries using socio-economic and health factors that determine the overall development of the country.
Problem Statement:
HELP International have been able to raise around $ 10 million. Now the CEO of the NGO needs to decide how to use this money strategically and effectively. So, CEO has to make decision to choose the countries that are in the direst need of aid. Hence, your Job as a Data scientist is to categorize the countries using some socio-economic and health factors that determine the overall development of the country. Then you need to suggest the countries which the CEO needs to focus on the most.
Context:
HELP International is an international humanitarian NGO that is committed to fighting poverty and providing the people of backward countries with basic amenities and relief during the time of disasters and natural calamities.
e1071, tidyverse, plotly, htmltools, devtools, caret, NbClust, reshape2, rvest, magrittr, stringr, cowplot, ggmap
DATA DICTIONARY
* country: Name of the country
* child_mort: Death of children under 5 years of age per 1000 live births
* exports: Exports of goods and services per capita. Given as %age of the GDP per capita
* health: Total health spending per capita. Given as %age of GDP per capita
* imports: Imports of goods and services per capita. Given as %age of the GDP per capita
* income: Net income per person
* inflation: The measurement of the annual growth rate of the Total GDP
* life_expec: The average number of years a new born child would live if the current mortality patterns are to remain the same
* total_fer: The number of children that would be born to each woman if the current age-fertility rates remain the same.
* gdpp: The GDP per capita. Calculated as the Total GDP divided by the total population.
Peep First Five Rows
## # A tibble: 6 × 10
## country child_mort exports health imports income inflation life_expec
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan 90.2 10 7.58 44.9 1610 9.44 56.2
## 2 Albania 16.6 28 6.55 48.6 9930 4.49 76.3
## 3 Algeria 27.3 38.4 4.17 31.4 12900 16.1 76.5
## 4 Angola 119 62.3 2.85 42.9 5900 22.4 60.1
## 5 Antigua and Barbuda 10.3 45.5 6.03 58.9 19100 1.44 76.8
## 6 Argentina 14.5 18.9 8.1 16 18700 20.9 75.8
## # … with 2 more variables: total_fer <dbl>, gdpp <dbl>
Data Dimensions
## Shape: 167 10
## Columns: country child_mort exports health imports income inflation life_expec total_fer gdpp
## Country Labels: 'Afghanistan', 'Albania', 'Algeria', 'Angola', 'Antigua and Barbuda' ...
…
## Total Missing Values: 0
…
No chr variables to convert to factor
## tibble [167 × 9] (S3: tbl_df/tbl/data.frame)
## $ child_mort: num [1:167] 90.2 16.6 27.3 119 10.3 14.5 18.1 4.8 4.3 39.2 ...
## $ exports : num [1:167] 10 28 38.4 62.3 45.5 18.9 20.8 19.8 51.3 54.3 ...
## $ health : num [1:167] 7.58 6.55 4.17 2.85 6.03 8.1 4.4 8.73 11 5.88 ...
## $ imports : num [1:167] 44.9 48.6 31.4 42.9 58.9 16 45.3 20.9 47.8 20.7 ...
## $ income : num [1:167] 1610 9930 12900 5900 19100 18700 6700 41400 43200 16000 ...
## $ inflation : num [1:167] 9.44 4.49 16.1 22.4 1.44 20.9 7.77 1.16 0.873 13.8 ...
## $ life_expec: num [1:167] 56.2 76.3 76.5 60.1 76.8 75.8 73.3 82 80.5 69.1 ...
## $ total_fer : num [1:167] 5.82 1.65 2.89 6.16 2.13 2.37 1.69 1.93 1.44 1.92 ...
## $ gdpp : num [1:167] 553 4090 4460 3530 12200 10300 3220 51900 46900 5840 ...
…
We visualized the correlation between each variables to see the strength of the relationship between each variables. This will give us good indication how would our cluster partition data into groups, which are GDP, life expectancy, and income.
…
## # A tibble: 6 × 9
## child_mort exports health imports income inflation life_expec total_fer
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.426 0.0495 0.359 0.258 0.00805 0.126 0.475 0.737
## 2 0.0682 0.140 0.295 0.279 0.0749 0.0804 0.872 0.0789
## 3 0.120 0.192 0.147 0.180 0.0988 0.188 0.876 0.274
## 4 0.567 0.311 0.0646 0.246 0.0425 0.246 0.552 0.790
## 5 0.0375 0.227 0.262 0.338 0.149 0.0522 0.882 0.155
## 6 0.0579 0.0940 0.391 0.0916 0.145 0.232 0.862 0.192
## # … with 1 more variable: gdpp <dbl>
We decided to use K-Means Clustering because our goal is to categorize the countries using socio-economic and health factors that determine the overall development of the country.
country_kmeans = kmeans(
countries,
centers=2,
algorithm="Lloyd",
iter.max=30
)
Evaluate Cluster Quality
## Variance Explained: 0.393483
We created K-means model with 2 centers but the variance explained was low at 0.3935. Before we are going to do hyperparameter tuning, we are first going to visualize our model to see how it looks.
…
Load Map Data
## long lat group order region subregion
## 1 -69.89912 12.45200 1 1 Aruba <NA>
## 2 -69.89571 12.42300 1 2 Aruba <NA>
## 3 -69.94219 12.43853 1 3 Aruba <NA>
## 4 -70.00415 12.50049 1 4 Aruba <NA>
## 5 -70.06612 12.54697 1 5 Aruba <NA>
## 6 -70.05088 12.59707 1 6 Aruba <NA>
Visualize Socio-Economic Clusters
The visualized socio-economic clusters looks great. But since it only has 2 clusters, it does not give us too much detail and result may look too generalized as the second cluster only had countries from Africa and some from Asia. We wanted to narrow down further to get better idea on which countries need the direst need of aids.
…
Elbow Method
NbClust Method
We used few different metrics to determine the best number of clusters. Both from the elbow chart and NbClust gave us 3 as the best number of clusters to used. Therefore, we will be retrain the model with 3 centers.
…
final_kmeans <- kmeans(
countries,
centers=3,
algorithm="Lloyd",
iter.max=30
)
Evaluate Cluster Quality
## Variance Explained: 0.547986
Variance explained has increased to 0.5479977, which is a lot better than the previous model (0.3935). We will visualized the clusters to the result to see if we need more tuning.
Visualize Socio-Economic Clusters
The graph gives us more detail on breaking down the countries with their socio-economic status. However, it appears that the countries with lower socio-economic status remains the same (majority of them are from Africa and some Asia). Therefore, we determined that the output of the model would not improve any further with our current dataset and decided to use this as our final model. We will be visualize the model in 3D to determine the 10 countries that need the aids the most.
…
In conclusion, the 10 countries that need the direst need of aid we selected are: Haiti, Central African Republic, Lesotho, Malawi, Zambia, Mozambique, Sierra Leone, Guinea-Bissau, Afghanistan, and Uganda. These countries have the lowest net income per person, life expectancy, and the GDP per capita.
While we were building the model, we factored in net income per person, life expectancy, and the GDP per capita as the vital variables in deciding a socio-economic status of a country. However, there might be other variables that might be considered more important. Also, building another model with the output from our cluster might give us more accurate outcome.